LINEAR REGRESSION MODEL

since the targert variable is numeric we use linear regression model

problem statement

  1. An insurance company wants to predict the customer life time value on diffrent qualitative and qauntitaive features provided.

  2. They are operating from last few years and maintaining all transactional information data. The given data ‘CustomerData.csv’ is a sample of customer level data extracted and processed for the analysis from various set of transactional files.

OBJECTIVE

  1. client wants to predict the customer life time value building a linear regression model.

steps to calculate customer lifetime value

  1. calculate average purchase value

  2. calculate average purchase frequency rate

  3. calculate customer value

  4. calculate average purchase frequency rate

  5. Calculate average customer lifespan

  6. calculate CLTV

how machine learning can help to improve clv

  1. A quality management team within an automaker moves with a challenging sphere of tension between customer satisfaction level, regulation and cost control

  2. predictive model can be built to automating customer lifetime value process, hence reducing the unfairness within the clv

loading the required libraries.

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
#### Load the required libraries
%matplotlib inline
import os 
import inspect
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import SVG
from IPython.display import display
import graphviz
import seaborn
from sklearn.tree import DecisionTreeClassifier

import pandas as pd#(for dataframes)
import numpy as np#(linear algebra)
from sklearn.model_selection import train_test_split#(for spliting the data)
from sklearn.impute import SimpleImputer#(for handling NA's values)
from sklearn.preprocessing import StandardScaler#(it helps in calculation of num data)
#required libraries for visulaization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
sns.set_palette("GnBu_d")
sns.set_style('whitegrid')
import missingno as msno
import pandas_profiling as pp
#There is an indication given in the result that there might exist a strong multicollinearity in the data. 
#Lets use variance inflation factor (VIF) to understand if there exist a multicollinearity and remove those attributes
from statsmodels.stats.outliers_influence import variance_inflation_factor


from sklearn.preprocessing import LabelEncoder
enc=LabelEncoder()
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_graphviz,DecisionTreeRegressor

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

import math

import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
#mlxtend : Machine learning extensions

import random
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import networkx as nx

## import libraries

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
#mlxtend : Machine learning extensions

import random
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import networkx as nx
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
## Extract the rules
from mlxtend.frequent_patterns import association_rules
!pip install mlxtend
!pip install networkx
from sklearn.metrics import accuracy_score, recall_score, precision_score,confusion_matrix,mean_absolute_error,mean_squared_error,classification_report
Requirement already satisfied: mlxtend in c:\users\smithika\anaconda3\lib\site-packages (0.17.0)
Requirement already satisfied: setuptools in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (41.4.0)
Requirement already satisfied: pandas>=0.24.2 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (0.25.1)
Requirement already satisfied: scipy>=1.2.1 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (1.3.1)
Requirement already satisfied: numpy>=1.16.2 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (1.16.5)
Requirement already satisfied: scikit-learn>=0.20.3 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (0.21.3)
Requirement already satisfied: matplotlib>=3.0.0 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (3.1.1)
Requirement already satisfied: joblib>=0.13.2 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (0.13.2)
Requirement already satisfied: pytz>=2017.2 in c:\users\smithika\anaconda3\lib\site-packages (from pandas>=0.24.2->mlxtend) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in c:\users\smithika\anaconda3\lib\site-packages (from pandas>=0.24.2->mlxtend) (2.8.0)
Requirement already satisfied: cycler>=0.10 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (2.4.2)
Requirement already satisfied: six>=1.5 in c:\users\smithika\anaconda3\lib\site-packages (from python-dateutil>=2.6.1->pandas>=0.24.2->mlxtend) (1.12.0)
Requirement already satisfied: networkx in c:\users\smithika\anaconda3\lib\site-packages (2.3)
Requirement already satisfied: decorator>=4.3.0 in c:\users\smithika\anaconda3\lib\site-packages (from networkx) (4.4.0)
In [14]:
# since pip install graphviz was unable to load this is the extended process to load the library
! pip install graphviz
! pip install mlxtend
Requirement already satisfied: graphviz in c:\users\smithika\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: mlxtend in c:\users\smithika\anaconda3\lib\site-packages (0.17.0)
Requirement already satisfied: joblib>=0.13.2 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (0.13.2)
Requirement already satisfied: pandas>=0.24.2 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (0.25.1)
Requirement already satisfied: setuptools in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (41.4.0)
Requirement already satisfied: scikit-learn>=0.20.3 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (0.21.3)
Requirement already satisfied: matplotlib>=3.0.0 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (3.1.1)
Requirement already satisfied: numpy>=1.16.2 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (1.16.5)
Requirement already satisfied: scipy>=1.2.1 in c:\users\smithika\anaconda3\lib\site-packages (from mlxtend) (1.3.1)
Requirement already satisfied: python-dateutil>=2.6.1 in c:\users\smithika\anaconda3\lib\site-packages (from pandas>=0.24.2->mlxtend) (2.8.0)
Requirement already satisfied: pytz>=2017.2 in c:\users\smithika\anaconda3\lib\site-packages (from pandas>=0.24.2->mlxtend) (2019.3)
Requirement already satisfied: cycler>=0.10 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (2.4.2)
Requirement already satisfied: six>=1.5 in c:\users\smithika\anaconda3\lib\site-packages (from python-dateutil>=2.6.1->pandas>=0.24.2->mlxtend) (1.12.0)
In [15]:
! pip install missingno
! pip install pandas_profiling
Requirement already satisfied: missingno in c:\users\smithika\anaconda3\lib\site-packages (0.4.2)
Requirement already satisfied: scipy in c:\users\smithika\anaconda3\lib\site-packages (from missingno) (1.3.1)
Requirement already satisfied: seaborn in c:\users\smithika\anaconda3\lib\site-packages (from missingno) (0.9.0)
Requirement already satisfied: numpy in c:\users\smithika\anaconda3\lib\site-packages (from missingno) (1.16.5)
Requirement already satisfied: matplotlib in c:\users\smithika\anaconda3\lib\site-packages (from missingno) (3.1.1)
Requirement already satisfied: pandas>=0.15.2 in c:\users\smithika\anaconda3\lib\site-packages (from seaborn->missingno) (0.25.1)
Requirement already satisfied: cycler>=0.10 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib->missingno) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib->missingno) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib->missingno) (2.4.2)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib->missingno) (2.8.0)
Requirement already satisfied: pytz>=2017.2 in c:\users\smithika\anaconda3\lib\site-packages (from pandas>=0.15.2->seaborn->missingno) (2019.3)
Requirement already satisfied: six in c:\users\smithika\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib->missingno) (1.12.0)
Requirement already satisfied: setuptools in c:\users\smithika\anaconda3\lib\site-packages (from kiwisolver>=1.0.1->matplotlib->missingno) (41.4.0)
Requirement already satisfied: pandas_profiling in c:\users\smithika\anaconda3\lib\site-packages (2.3.0)
Requirement already satisfied: phik>=0.9.8 in c:\users\smithika\anaconda3\lib\site-packages (from pandas_profiling) (0.9.8)
Requirement already satisfied: missingno>=0.4.2 in c:\users\smithika\anaconda3\lib\site-packages (from pandas_profiling) (0.4.2)
Requirement already satisfied: pandas>=0.19 in c:\users\smithika\anaconda3\lib\site-packages (from pandas_profiling) (0.25.1)
Requirement already satisfied: astropy in c:\users\smithika\anaconda3\lib\site-packages (from pandas_profiling) (3.2.1)
Requirement already satisfied: htmlmin>=0.1.12 in c:\users\smithika\anaconda3\lib\site-packages (from pandas_profiling) (0.1.12)
Requirement already satisfied: jinja2>=2.8 in c:\users\smithika\anaconda3\lib\site-packages (from pandas_profiling) (2.10.3)
Requirement already satisfied: matplotlib>=1.4 in c:\users\smithika\anaconda3\lib\site-packages (from pandas_profiling) (3.1.1)
Requirement already satisfied: confuse>=1.0.0 in c:\users\smithika\anaconda3\lib\site-packages (from pandas_profiling) (1.0.0)
Requirement already satisfied: pytest>=4.0.2 in c:\users\smithika\anaconda3\lib\site-packages (from phik>=0.9.8->pandas_profiling) (5.2.1)
Requirement already satisfied: numpy>=1.15.4 in c:\users\smithika\anaconda3\lib\site-packages (from phik>=0.9.8->pandas_profiling) (1.16.5)
Requirement already satisfied: pytest-pylint>=0.13.0 in c:\users\smithika\anaconda3\lib\site-packages (from phik>=0.9.8->pandas_profiling) (0.14.1)
Requirement already satisfied: jupyter-client>=5.2.3 in c:\users\smithika\anaconda3\lib\site-packages (from phik>=0.9.8->pandas_profiling) (5.3.3)
Requirement already satisfied: nbconvert>=5.3.1 in c:\users\smithika\anaconda3\lib\site-packages (from phik>=0.9.8->pandas_profiling) (5.6.0)
Requirement already satisfied: scipy>=1.1.0 in c:\users\smithika\anaconda3\lib\site-packages (from phik>=0.9.8->pandas_profiling) (1.3.1)
Requirement already satisfied: numba>=0.38.1 in c:\users\smithika\anaconda3\lib\site-packages (from phik>=0.9.8->pandas_profiling) (0.45.1)
Requirement already satisfied: seaborn in c:\users\smithika\anaconda3\lib\site-packages (from missingno>=0.4.2->pandas_profiling) (0.9.0)
Requirement already satisfied: python-dateutil>=2.6.1 in c:\users\smithika\anaconda3\lib\site-packages (from pandas>=0.19->pandas_profiling) (2.8.0)
Requirement already satisfied: pytz>=2017.2 in c:\users\smithika\anaconda3\lib\site-packages (from pandas>=0.19->pandas_profiling) (2019.3)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\smithika\anaconda3\lib\site-packages (from jinja2>=2.8->pandas_profiling) (1.1.1)
Requirement already satisfied: cycler>=0.10 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib>=1.4->pandas_profiling) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib>=1.4->pandas_profiling) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\smithika\anaconda3\lib\site-packages (from matplotlib>=1.4->pandas_profiling) (2.4.2)
Requirement already satisfied: pyyaml in c:\users\smithika\anaconda3\lib\site-packages (from confuse>=1.0.0->pandas_profiling) (5.1.2)
Requirement already satisfied: py>=1.5.0 in c:\users\smithika\anaconda3\lib\site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas_profiling) (1.8.0)
Requirement already satisfied: packaging in c:\users\smithika\anaconda3\lib\site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas_profiling) (19.2)
Requirement already satisfied: attrs>=17.4.0 in c:\users\smithika\anaconda3\lib\site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas_profiling) (19.2.0)
Requirement already satisfied: more-itertools>=4.0.0 in c:\users\smithika\anaconda3\lib\site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas_profiling) (7.2.0)
Requirement already satisfied: atomicwrites>=1.0 in c:\users\smithika\anaconda3\lib\site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas_profiling) (1.3.0)
Requirement already satisfied: pluggy<1.0,>=0.12 in c:\users\smithika\anaconda3\lib\site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas_profiling) (0.13.0)
Requirement already satisfied: wcwidth in c:\users\smithika\anaconda3\lib\site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas_profiling) (0.1.7)
Requirement already satisfied: importlib-metadata>=0.12 in c:\users\smithika\anaconda3\lib\site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas_profiling) (0.23)
Requirement already satisfied: colorama in c:\users\smithika\anaconda3\lib\site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas_profiling) (0.4.1)
Requirement already satisfied: six in c:\users\smithika\anaconda3\lib\site-packages (from pytest-pylint>=0.13.0->phik>=0.9.8->pandas_profiling) (1.12.0)
Requirement already satisfied: pylint>=1.4.5 in c:\users\smithika\anaconda3\lib\site-packages (from pytest-pylint>=0.13.0->phik>=0.9.8->pandas_profiling) (2.4.2)
Requirement already satisfied: traitlets in c:\users\smithika\anaconda3\lib\site-packages (from jupyter-client>=5.2.3->phik>=0.9.8->pandas_profiling) (4.3.3)
Requirement already satisfied: pyzmq>=13 in c:\users\smithika\anaconda3\lib\site-packages (from jupyter-client>=5.2.3->phik>=0.9.8->pandas_profiling) (18.1.0)
Requirement already satisfied: jupyter-core in c:\users\smithika\anaconda3\lib\site-packages (from jupyter-client>=5.2.3->phik>=0.9.8->pandas_profiling) (4.5.0)
Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\users\smithika\anaconda3\lib\site-packages (from jupyter-client>=5.2.3->phik>=0.9.8->pandas_profiling) (223)
Requirement already satisfied: tornado>=4.1 in c:\users\smithika\anaconda3\lib\site-packages (from jupyter-client>=5.2.3->phik>=0.9.8->pandas_profiling) (6.0.3)
Requirement already satisfied: testpath in c:\users\smithika\anaconda3\lib\site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas_profiling) (0.4.2)
Requirement already satisfied: entrypoints>=0.2.2 in c:\users\smithika\anaconda3\lib\site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas_profiling) (0.3)
Requirement already satisfied: defusedxml in c:\users\smithika\anaconda3\lib\site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas_profiling) (0.6.0)
Requirement already satisfied: nbformat>=4.4 in c:\users\smithika\anaconda3\lib\site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas_profiling) (4.4.0)
Requirement already satisfied: bleach in c:\users\smithika\anaconda3\lib\site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas_profiling) (3.1.0)
Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\smithika\anaconda3\lib\site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas_profiling) (0.8.4)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\smithika\anaconda3\lib\site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas_profiling) (1.4.2)
Requirement already satisfied: pygments in c:\users\smithika\anaconda3\lib\site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas_profiling) (2.4.2)
Requirement already satisfied: llvmlite>=0.29.0dev0 in c:\users\smithika\anaconda3\lib\site-packages (from numba>=0.38.1->phik>=0.9.8->pandas_profiling) (0.29.0)
Requirement already satisfied: setuptools in c:\users\smithika\anaconda3\lib\site-packages (from kiwisolver>=1.0.1->matplotlib>=1.4->pandas_profiling) (41.4.0)
Requirement already satisfied: zipp>=0.5 in c:\users\smithika\anaconda3\lib\site-packages (from importlib-metadata>=0.12->pytest>=4.0.2->phik>=0.9.8->pandas_profiling) (0.6.0)
Requirement already satisfied: astroid<2.4,>=2.3.0 in c:\users\smithika\anaconda3\lib\site-packages (from pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas_profiling) (2.3.1)
Requirement already satisfied: isort<5,>=4.2.5 in c:\users\smithika\anaconda3\lib\site-packages (from pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas_profiling) (4.3.21)
Requirement already satisfied: mccabe<0.7,>=0.6 in c:\users\smithika\anaconda3\lib\site-packages (from pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas_profiling) (0.6.1)
Requirement already satisfied: ipython-genutils in c:\users\smithika\anaconda3\lib\site-packages (from traitlets->jupyter-client>=5.2.3->phik>=0.9.8->pandas_profiling) (0.2.0)
Requirement already satisfied: decorator in c:\users\smithika\anaconda3\lib\site-packages (from traitlets->jupyter-client>=5.2.3->phik>=0.9.8->pandas_profiling) (4.4.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\smithika\anaconda3\lib\site-packages (from nbformat>=4.4->nbconvert>=5.3.1->phik>=0.9.8->pandas_profiling) (3.0.2)
Requirement already satisfied: webencodings in c:\users\smithika\anaconda3\lib\site-packages (from bleach->nbconvert>=5.3.1->phik>=0.9.8->pandas_profiling) (0.5.1)
Requirement already satisfied: typed-ast<1.5,>=1.4.0; implementation_name == "cpython" and python_version < "3.8" in c:\users\smithika\anaconda3\lib\site-packages (from astroid<2.4,>=2.3.0->pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas_profiling) (1.4.0)
Requirement already satisfied: wrapt==1.11.* in c:\users\smithika\anaconda3\lib\site-packages (from astroid<2.4,>=2.3.0->pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas_profiling) (1.11.2)
Requirement already satisfied: lazy-object-proxy==1.4.* in c:\users\smithika\anaconda3\lib\site-packages (from astroid<2.4,>=2.3.0->pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas_profiling) (1.4.2)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\smithika\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert>=5.3.1->phik>=0.9.8->pandas_profiling) (0.15.4)

loading the data

1.setting with current working directory with jupyter notebook

In [16]:
# to check which directory my python note book is working
os.getcwd()
os
Out[16]:
<module 'os' from 'C:\\Users\\smithika\\Anaconda3\\lib\\os.py'>
In [17]:
# chdir is to change my working directory to my current working directory
os.chdir('C:\\Users\\smithika\\Desktop\\final mith')
In [18]:
# loading the data
df= pd.read_csv("train-1574429526318.csv")
In [19]:
# the detail data which is loaded
df
Out[19]:
CustomerID Customer.Lifetime.Value Coverage Education EmploymentStatus Gender Income Location.Geo Location.Code Marital.Status ... Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies Policy.Type Policy Renew.Offer.Type Sales.Channel Total.Claim.Amount Vehicle.Class Vehicle.Size
0 5917 7824.372789 Basic Bachelor Unemployed F 0 17.7,77.7 Urban Married ... 33 NaN 2.0 Personal Auto Personal L2 Offer2 Branch 267.214383 Four-Door Car 2.0
1 2057 8005.964669 Basic College Employed M 63357 28.8,76.6 Suburban Married ... 42 0.0 5.0 Personal Auto Personal L2 Offer2 Agent 565.508572 SUV 2.0
2 4119 8646.504109 Basic High School or Below Employed F 64125 21.6,88.4 Urban Married ... 44 0.0 3.0 Personal Auto Personal L1 Offer2 Branch 369.818708 SUV 1.0
3 1801 9294.088719 Basic College Employed M 67544 19,72.5 Suburban Married ... 15 NaN 3.0 Corporate Auto Corporate L3 Offer1 Branch 556.800000 SUV 3.0
4 9618 5595.971365 Basic Bachelor Retired F 19651 19.1,74.7 Suburban Married ... 68 0.0 5.0 Personal Auto Personal L1 Offer2 Web 345.600000 Two-Door Car 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9801 3735 20496.694260 Basic High School or Below Unemployed F 0 12.7,79.4 Suburban Single ... 72 0.0 2.0 Personal Auto Personal L2 Offer1 Branch 307.200000 Four-Door Car 2.0
9802 5988 2592.437797 Basic High School or Below Employed M 72421 18.6,72.3 Suburban Married ... 23 0.0 1.0 Corporate Auto Corporate L3 Offer2 Call Center 312.000000 Four-Door Car 3.0
9803 8767 3103.923041 Extended College Employed F 74665 19.2,74.7 Urban Married ... 90 2.0 1.0 Corporate Auto Corporate L2 Offer2 Call Center 236.902001 Four-Door Car 2.0
9804 9900 9161.655119 Basic High School or Below Employed F 91763 19.5,73.9 Urban Married ... 64 0.0 3.0 Special Auto Special L3 Offer1 Call Center 441.992043 SUV 3.0
9805 11323 8583.272854 Premium High School or Below Disabled F 18017 17.2,78.2 Suburban Divorced ... 54 0.0 9.0 Personal Auto Personal L3 Offer2 Call Center 547.200000 Four-Door Car 2.0

9806 rows × 22 columns

Data understanding

In [20]:
# data.head gives us the head of data to understand the parameters easy
df.head(5)
Out[20]:
CustomerID Customer.Lifetime.Value Coverage Education EmploymentStatus Gender Income Location.Geo Location.Code Marital.Status ... Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies Policy.Type Policy Renew.Offer.Type Sales.Channel Total.Claim.Amount Vehicle.Class Vehicle.Size
0 5917 7824.372789 Basic Bachelor Unemployed F 0 17.7,77.7 Urban Married ... 33 NaN 2.0 Personal Auto Personal L2 Offer2 Branch 267.214383 Four-Door Car 2.0
1 2057 8005.964669 Basic College Employed M 63357 28.8,76.6 Suburban Married ... 42 0.0 5.0 Personal Auto Personal L2 Offer2 Agent 565.508572 SUV 2.0
2 4119 8646.504109 Basic High School or Below Employed F 64125 21.6,88.4 Urban Married ... 44 0.0 3.0 Personal Auto Personal L1 Offer2 Branch 369.818708 SUV 1.0
3 1801 9294.088719 Basic College Employed M 67544 19,72.5 Suburban Married ... 15 NaN 3.0 Corporate Auto Corporate L3 Offer1 Branch 556.800000 SUV 3.0
4 9618 5595.971365 Basic Bachelor Retired F 19651 19.1,74.7 Suburban Married ... 68 0.0 5.0 Personal Auto Personal L1 Offer2 Web 345.600000 Two-Door Car 3.0

5 rows × 22 columns

In [21]:
## to check the type of the data
type(df)
Out[21]:
pandas.core.frame.DataFrame
In [22]:
# to check the shape of the data
df.shape
Out[22]:
(9806, 22)

summary statistics

In [23]:
# this is the summary statistic of df
df.describe(include='all')
Out[23]:
CustomerID Customer.Lifetime.Value Coverage Education EmploymentStatus Gender Income Location.Geo Location.Code Marital.Status ... Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies Policy.Type Policy Renew.Offer.Type Sales.Channel Total.Claim.Amount Vehicle.Class Vehicle.Size
count 9806.000000 9806.000000 8881 9677 9688 9677 9806 9806 9687 9677 ... 9806.000000 8988.000000 9685.000000 8915 9685 9678 9678 9806.000000 9680 9680.000000
unique NaN NaN 3 5 5 2 4622 2840 3 3 ... NaN NaN NaN 3 9 4 4 NaN 6 NaN
top NaN NaN Basic Bachelor Employed F 0 NA,NA Suburban Married ... NaN NaN NaN Personal Auto Personal L3 Offer1 Agent NaN Four-Door Car NaN
freq NaN NaN 5361 2934 6020 4985 2461 119 6204 5643 ... NaN NaN NaN 6620 3637 3975 3670 NaN 4869 NaN
mean 5778.381807 7998.047015 NaN NaN NaN NaN NaN NaN NaN NaN ... 48.165001 0.379172 2.960351 NaN NaN NaN NaN 438.266734 NaN 2.089773
std 3343.286093 6848.055899 NaN NaN NaN NaN NaN NaN NaN NaN ... 27.963630 0.896427 2.389801 NaN NaN NaN NaN 293.502301 NaN 0.538524
min 1.000000 1898.007675 NaN NaN NaN NaN NaN NaN NaN NaN ... 0.000000 0.000000 1.000000 NaN NaN NaN NaN 0.099007 NaN 1.000000
25% 2879.250000 4013.949039 NaN NaN NaN NaN NaN NaN NaN NaN ... 24.000000 0.000000 1.000000 NaN NaN NaN NaN 280.352767 NaN 2.000000
50% 5783.000000 5780.182197 NaN NaN NaN NaN NaN NaN NaN NaN ... 48.000000 0.000000 2.000000 NaN NaN NaN NaN 384.007015 NaN 2.000000
75% 8678.750000 8960.280213 NaN NaN NaN NaN NaN NaN NaN NaN ... 71.750000 0.000000 4.000000 NaN NaN NaN NaN 553.540973 NaN 2.000000
max 11573.000000 83325.381190 NaN NaN NaN NaN NaN NaN NaN NaN ... 99.000000 5.000000 9.000000 NaN NaN NaN NaN 2893.239678 NaN 3.000000

11 rows × 22 columns

In [24]:
df.loc[ (df['number.of.open.complaints']==5.000000),]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'number.of.open.complaints'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-24-62a6b191ef2e> in <module>
----> 1 df.loc[ (df['number.of.open.complaints']==5.000000),]

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'number.of.open.complaints'
In [25]:
df.loc[ (df['number.of.policies']==9.000000),]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'number.of.policies'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-25-c65b8ca2e84b> in <module>
----> 1 df.loc[ (df['number.of.policies']==9.000000),]

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'number.of.policies'
In [26]:
#converting all the column names to lower case for more convinience..
df.columns = [x.lower().replace(' ','_') for x in df.columns]
In [27]:
## to check the index of the df
df.index
Out[27]:
RangeIndex(start=0, stop=9806, step=1)
In [28]:
# checking wheather names are been converted or not
df.head()
Out[28]:
customerid customer.lifetime.value coverage education employmentstatus gender income location.geo location.code marital.status ... months.since.policy.inception number.of.open.complaints number.of.policies policy.type policy renew.offer.type sales.channel total.claim.amount vehicle.class vehicle.size
0 5917 7824.372789 Basic Bachelor Unemployed F 0 17.7,77.7 Urban Married ... 33 NaN 2.0 Personal Auto Personal L2 Offer2 Branch 267.214383 Four-Door Car 2.0
1 2057 8005.964669 Basic College Employed M 63357 28.8,76.6 Suburban Married ... 42 0.0 5.0 Personal Auto Personal L2 Offer2 Agent 565.508572 SUV 2.0
2 4119 8646.504109 Basic High School or Below Employed F 64125 21.6,88.4 Urban Married ... 44 0.0 3.0 Personal Auto Personal L1 Offer2 Branch 369.818708 SUV 1.0
3 1801 9294.088719 Basic College Employed M 67544 19,72.5 Suburban Married ... 15 NaN 3.0 Corporate Auto Corporate L3 Offer1 Branch 556.800000 SUV 3.0
4 9618 5595.971365 Basic Bachelor Retired F 19651 19.1,74.7 Suburban Married ... 68 0.0 5.0 Personal Auto Personal L1 Offer2 Web 345.600000 Two-Door Car 3.0

5 rows × 22 columns

In [29]:
#writing a function to get the null values and unique values in the dataset
def levels(df):
    return (pd.DataFrame({'dtype':df.dtypes, 
                          'levels':df.nunique(), 
                          'levels':[df[x].unique() for x in df.columns],
                         'null_values':df.isna().sum(),
                         'unique':df.nunique()}))
levels(df)
Out[29]:
dtype levels null_values unique
customerid int64 [5917, 2057, 4119, 1801, 9618, 2747, 3633, 385... 0 9806
customer.lifetime.value float64 [7824.372789, 8005.964669, 8646.504109, 9294.0... 0 6477
coverage object [Basic, Extended, nan, Premium] 925 3
education object [Bachelor, College, High School or Below, Doct... 129 5
employmentstatus object [Unemployed, Employed, Retired, Medical Leave,... 118 5
gender object [F, M, nan] 129 2
income object [0, 63357, 64125, 67544, 19651, 23589, 74126, ... 0 4622
location.geo object [17.7,77.7, 28.8,76.6, 21.6,88.4, 19,72.5, 19.... 0 2840
location.code object [Urban, Suburban, Rural, nan] 119 3
marital.status object [Married, Divorced, Single, nan] 129 3
monthly.premium.auto float64 [67.0, 101.0, 108.0, 116.0, 72.0, 211.0, 90.0,... 794 191
months.since.last.claim int64 [2, 26, 3, 30, 14, 10, 6, 17, 13, 4, 8, 16, 29... 0 36
months.since.policy.inception int64 [33, 42, 44, 15, 68, 13, 76, 50, 62, 41, 21, 5... 0 100
number.of.open.complaints float64 [nan, 0.0, 1.0, 2.0, 3.0, 5.0, 4.0] 818 6
number.of.policies float64 [2.0, 5.0, 3.0, 6.0, 1.0, 9.0, 8.0, nan, 7.0, ... 121 9
policy.type object [Personal Auto, Corporate Auto, Special Auto, ... 891 3
policy object [Personal L2, Personal L1, Corporate L3, Speci... 121 9
renew.offer.type object [Offer2, Offer1, Offer3, nan, Offer4] 128 4
sales.channel object [Branch, Agent, Web, Call Center, nan] 128 4
total.claim.amount float64 [267.214383, 565.508572, 369.818708, 556.8, 34... 0 4125
vehicle.class object [Four-Door Car, SUV, Two-Door Car, Luxury SUV,... 126 6
vehicle.size float64 [2.0, 1.0, 3.0, nan] 126 3

Observations:

  1. we have three diffrent tyes of data i.e object type(which are categorical), Integer64, & float64
  2. we canot build a model keeping object until is correctly framed with required transformation
  3. Summary statistics for CustomerID & location.geo is not meaningful hence dropping the columns.
  4. now we can segregate categorical and numerical attributes separetly.
In [30]:
df.drop("location.geo",axis=1,inplace=True)
df.drop("customerid",axis=1,inplace=True)
In [31]:
df.head(3)
Out[31]:
customer.lifetime.value coverage education employmentstatus gender income location.code marital.status monthly.premium.auto months.since.last.claim months.since.policy.inception number.of.open.complaints number.of.policies policy.type policy renew.offer.type sales.channel total.claim.amount vehicle.class vehicle.size
0 7824.372789 Basic Bachelor Unemployed F 0 Urban Married 67.0 2 33 NaN 2.0 Personal Auto Personal L2 Offer2 Branch 267.214383 Four-Door Car 2.0
1 8005.964669 Basic College Employed M 63357 Suburban Married 101.0 26 42 0.0 5.0 Personal Auto Personal L2 Offer2 Agent 565.508572 SUV 2.0
2 8646.504109 Basic High School or Below Employed F 64125 Urban Married 108.0 3 44 0.0 3.0 Personal Auto Personal L1 Offer2 Branch 369.818708 SUV 1.0

missing values.

In [32]:
## to checking the total missing values
df.isna().sum()
Out[32]:
customer.lifetime.value            0
coverage                         925
education                        129
employmentstatus                 118
gender                           129
income                             0
location.code                    119
marital.status                   129
monthly.premium.auto             794
months.since.last.claim            0
months.since.policy.inception      0
number.of.open.complaints        818
number.of.policies               121
policy.type                      891
policy                           121
renew.offer.type                 128
sales.channel                    128
total.claim.amount                 0
vehicle.class                    126
vehicle.size                     126
dtype: int64
In [33]:
# dropping the missing values helps us to plot the numeric attributes easily
df.dropna(axis=0, inplace=True) 
In [34]:
# checking if we any duplicate values
df2=df.duplicated(keep='first')
In [35]:
df2
Out[35]:
1       False
2       False
4       False
6       False
7       False
        ...  
9801    False
9802    False
9803    False
9804    False
9805    False
Length: 6947, dtype: bool
In [36]:
print(df.shape)
(6947, 20)
In [37]:
## to check the numeric attributes in my df
num_attr = df.select_dtypes(include=['int64', 'float64']).columns
num_attr
Out[37]:
Index(['customer.lifetime.value', 'monthly.premium.auto',
       'months.since.last.claim', 'months.since.policy.inception',
       'number.of.open.complaints', 'number.of.policies', 'total.claim.amount',
       'vehicle.size'],
      dtype='object')
In [38]:
## to check my categorical or object is been identified in df
num_cat= df.select_dtypes(include=['object']).columns
num_cat
Out[38]:
Index(['coverage', 'education', 'employmentstatus', 'gender', 'income',
       'location.code', 'marital.status', 'policy.type', 'policy',
       'renew.offer.type', 'sales.channel', 'vehicle.class'],
      dtype='object')
In [39]:
df[num_attr].corr()
Out[39]:
customer.lifetime.value monthly.premium.auto months.since.last.claim months.since.policy.inception number.of.open.complaints number.of.policies total.claim.amount vehicle.size
customer.lifetime.value 1.000000 0.392827 0.004085 0.001565 -0.042126 0.020455 0.219363 0.013654
monthly.premium.auto 0.392827 1.000000 0.011426 0.024690 -0.022610 -0.018534 0.645253 0.003234
months.since.last.claim 0.004085 0.011426 1.000000 -0.030475 0.019958 0.008672 0.013137 -0.003668
months.since.policy.inception 0.001565 0.024690 -0.030475 1.000000 -0.014129 0.005204 0.006255 0.007495
number.of.open.complaints -0.042126 -0.022610 0.019958 -0.014129 1.000000 -0.003843 -0.019976 -0.014940
number.of.policies 0.020455 -0.018534 0.008672 0.005204 -0.003843 1.000000 -0.014460 0.022783
total.claim.amount 0.219363 0.645253 0.013137 0.006255 -0.019976 -0.014460 1.000000 0.069422
vehicle.size 0.013654 0.003234 -0.003668 0.007495 -0.014940 0.022783 0.069422 1.000000

scatter plot matrix required for numeric attributes.

In [40]:
pd.plotting.scatter_matrix (df, figsize=(16, 16), diagonal='kde')
plt.show()
In [41]:
plt.figure(figsize=(8,8))
sns.heatmap(df.corr())
plt.show()

exploratory data

In [42]:
###Plotting Categorical Data
sns.countplot(x="vehicle.class", data=df)
plt.show()

observation

  1. since we found huge customers who bought the four-door-cars, suv,two-door-car and so on following
  2. now we can predict the clients who are most bought

converstion of object attributes..& dumification

In [43]:
cat=('coverage', 'education', 'employmentstatus', 'gender', 'income',
       'location.geo', 'location.code', 'marital.status', 'policy.type',
       'policy', 'renew.offer.type', 'sales.channel', 'vehicle.class')
In [44]:
cat
Out[44]:
('coverage',
 'education',
 'employmentstatus',
 'gender',
 'income',
 'location.geo',
 'location.code',
 'marital.status',
 'policy.type',
 'policy',
 'renew.offer.type',
 'sales.channel',
 'vehicle.class')
In [45]:
for col in ['coverage','education','employmentstatus','gender','income',
 'location.code',
 'marital.status',
 'policy.type',
 'policy',
 'renew.offer.type',
 'sales.channel',
 'vehicle.class']:
    df[col] = df[col].astype('category')
In [46]:
df[col]
Out[46]:
1                 SUV
2                 SUV
4        Two-Door Car
6        Two-Door Car
7       Four-Door Car
            ...      
9801    Four-Door Car
9802    Four-Door Car
9803    Four-Door Car
9804              SUV
9805    Four-Door Car
Name: vehicle.class, Length: 6947, dtype: category
Categories (6, object): [Four-Door Car, Luxury Car, Luxury SUV, SUV, Sports Car, Two-Door Car]
In [47]:
cols=["coverage", "education","employmentstatus","gender","income",
 "marital.status",
 "policy.type",
 "policy",
 "renew.offer.type",
 "sales.channel",
 "vehicle.class"]
df=pd.get_dummies(columns=cols,data=df,drop_first=True)
In [48]:
df.head(3)
Out[48]:
customer.lifetime.value location.code monthly.premium.auto months.since.last.claim months.since.policy.inception number.of.open.complaints number.of.policies total.claim.amount vehicle.size coverage_Extended ... renew.offer.type_Offer3 renew.offer.type_Offer4 sales.channel_Branch sales.channel_Call Center sales.channel_Web vehicle.class_Luxury Car vehicle.class_Luxury SUV vehicle.class_SUV vehicle.class_Sports Car vehicle.class_Two-Door Car
1 8005.964669 Suburban 101.0 26 42 0.0 5.0 565.508572 2.0 0 ... 0 0 0 0 0 0 0 1 0 0
2 8646.504109 Urban 108.0 3 44 0.0 3.0 369.818708 1.0 0 ... 0 0 1 0 0 0 0 1 0 0
4 5595.971365 Suburban 72.0 3 68 0.0 5.0 345.600000 3.0 0 ... 0 0 0 0 1 0 0 0 0 1

3 rows × 3980 columns

In [49]:
df.isna().sum()
Out[49]:
customer.lifetime.value          0
location.code                    0
monthly.premium.auto             0
months.since.last.claim          0
months.since.policy.inception    0
                                ..
vehicle.class_Luxury Car         0
vehicle.class_Luxury SUV         0
vehicle.class_SUV                0
vehicle.class_Sports Car         0
vehicle.class_Two-Door Car       0
Length: 3980, dtype: int64

even after dropping the missing values the location code and monthly premium attribute contained missing values so just droppedit

In [50]:
df.dropna(axis=0, inplace=True)
In [51]:
df.isna().sum()
Out[51]:
customer.lifetime.value          0
location.code                    0
monthly.premium.auto             0
months.since.last.claim          0
months.since.policy.inception    0
                                ..
vehicle.class_Luxury Car         0
vehicle.class_Luxury SUV         0
vehicle.class_SUV                0
vehicle.class_Sports Car         0
vehicle.class_Two-Door Car       0
Length: 3980, dtype: int64
In [52]:
df.dtypes
Out[52]:
customer.lifetime.value           float64
location.code                    category
monthly.premium.auto              float64
months.since.last.claim             int64
months.since.policy.inception       int64
                                   ...   
vehicle.class_Luxury Car            uint8
vehicle.class_Luxury SUV            uint8
vehicle.class_SUV                   uint8
vehicle.class_Sports Car            uint8
vehicle.class_Two-Door Car          uint8
Length: 3980, dtype: object

Plotting Numeric VS Numeric Data

This function lmplot from seaborn combines regplot() and FacetGrid.

It is intended as a convenient interface to fit regression models across conditional subsets of a dataset.

sns.lmplot(x='vehicle.class', y='customer.lifetime.value', data=data) plt.show()

showing the correlation between the target variable with diffrent variables

In [53]:
# defining the numerical attributes with value counts
df['vehicle.class'].value_counts()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'vehicle.class'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-53-b928aaadb1c5> in <module>
      1 # defining the numerical attributes with value counts
----> 2 df['vehicle.class'].value_counts()

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'vehicle.class'
In [54]:
df['vehicle.size'].value_counts()
Out[54]:
2.0    4846
3.0    1373
1.0     728
Name: vehicle.size, dtype: int64
In [68]:
sns.countplot(x="vehicle.size", data=df)
plt.show()
In [69]:
df['renew.offer.type'].value_counts()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'renew.offer.type'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-69-c2e510fe3606> in <module>
----> 1 df['renew.offer.type'].value_counts()

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'renew.offer.type'
In [70]:
df['sales.channel'].value_counts()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'sales.channel'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-70-2ccd6dcf6fbe> in <module>
----> 1 df['sales.channel'].value_counts()

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'sales.channel'
In [71]:
df['policy'].value_counts()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'policy'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-71-313b52e80a4e> in <module>
----> 1 df['policy'].value_counts()

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'policy'
In [72]:
df['employmentstatus'].value_counts()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'employmentstatus'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-72-4ad36fe34091> in <module>
----> 1 df['employmentstatus'].value_counts()

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'employmentstatus'
In [73]:
df['policy.type'].value_counts()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'policy.type'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-73-9dbb7b5d8277> in <module>
----> 1 df['policy.type'].value_counts()

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'policy.type'
In [74]:
sns.boxplot(x="vehicle.class", y="customer.lifetime.value", data=df, palette="PRGn")
plt.show()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-74-293c19d9c407> in <module>
----> 1 sns.boxplot(x="vehicle.class", y="customer.lifetime.value", data=df, palette="PRGn")
      2 plt.show()

~\Anaconda3\lib\site-packages\seaborn\categorical.py in boxplot(x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth, whis, notch, ax, **kwargs)
   2229     plotter = _BoxPlotter(x, y, hue, data, order, hue_order,
   2230                           orient, color, palette, saturation,
-> 2231                           width, dodge, fliersize, linewidth)
   2232 
   2233     if ax is None:

~\Anaconda3\lib\site-packages\seaborn\categorical.py in __init__(self, x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth)
    444                  width, dodge, fliersize, linewidth):
    445 
--> 446         self.establish_variables(x, y, hue, data, orient, order, hue_order)
    447         self.establish_colors(color, palette, saturation)
    448 

~\Anaconda3\lib\site-packages\seaborn\categorical.py in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
    153                 if isinstance(input, string_types):
    154                     err = "Could not interpret input '{}'".format(input)
--> 155                     raise ValueError(err)
    156 
    157             # Figure out the plotting orientation

ValueError: Could not interpret input 'vehicle.class'
In [75]:
sns.boxplot(x="employmentstatus", y="customer.lifetime.value", data=df, palette="PRGn")
plt.show()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-75-c7b18f58f37e> in <module>
----> 1 sns.boxplot(x="employmentstatus", y="customer.lifetime.value", data=df, palette="PRGn")
      2 plt.show()

~\Anaconda3\lib\site-packages\seaborn\categorical.py in boxplot(x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth, whis, notch, ax, **kwargs)
   2229     plotter = _BoxPlotter(x, y, hue, data, order, hue_order,
   2230                           orient, color, palette, saturation,
-> 2231                           width, dodge, fliersize, linewidth)
   2232 
   2233     if ax is None:

~\Anaconda3\lib\site-packages\seaborn\categorical.py in __init__(self, x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth)
    444                  width, dodge, fliersize, linewidth):
    445 
--> 446         self.establish_variables(x, y, hue, data, orient, order, hue_order)
    447         self.establish_colors(color, palette, saturation)
    448 

~\Anaconda3\lib\site-packages\seaborn\categorical.py in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
    153                 if isinstance(input, string_types):
    154                     err = "Could not interpret input '{}'".format(input)
--> 155                     raise ValueError(err)
    156 
    157             # Figure out the plotting orientation

ValueError: Could not interpret input 'employmentstatus'
In [76]:
# Plotting Categorical Data
sns.countplot(x="customer.lifetime.value", data=df)
plt.show()
In [77]:
# Make default histogram of FrequencyOFPlay
sns.distplot(df["total.claim.amount"] )
# Bins can be changes
plt.show()
In [78]:
#polt histogram of FrequencyOFPlay
sns.distplot(df["customer.lifetime.value"] )
# Bins can be changes
plt.show()
In [79]:
df['customer.lifetime.value'].value_counts()
Out[79]:
10656.881950    9
8382.630118     9
22332.439460    8
7285.030983     8
4270.034394     8
               ..
7716.439345     1
25093.570370    1
2064.458781     1
2939.939038     1
3319.620511     1
Name: customer.lifetime.value, Length: 5482, dtype: int64
In [80]:
# sns plot graph
sns.lmplot(x='total.claim.amount', y='customer.lifetime.value', data=df)
plt.show()

spilting the train and test data

In [81]:
#Performing train test split on the data
X, y = df.loc[:,df.columns!='customer.lifetime.value'], df.loc[:,'customer.lifetime.value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
In [82]:
#To get the distribution in the target in train and test
print(pd.value_counts(y_train))
print(pd.value_counts(y_test))
8382.630118     9
19160.989940    7
4626.801093     7
13197.928930    6
5639.941974     6
               ..
3240.667531     1
10633.504960    1
8447.801005     1
9476.901955     1
3819.619502     1
Name: customer.lifetime.value, Length: 4039, dtype: int64
5568.947534     5
10656.881950    5
5498.940679     4
12157.329920    4
3919.366722     4
               ..
7524.735746     1
2229.361773     1
15306.224930    1
8163.890428     1
4491.909095     1
Name: customer.lifetime.value, Length: 1906, dtype: int64
In [83]:
# loading the train model
train_data = pd.read_csv("train-1574429526318.csv")
train_data.head()
Out[83]:
CustomerID Customer.Lifetime.Value Coverage Education EmploymentStatus Gender Income Location.Geo Location.Code Marital.Status ... Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies Policy.Type Policy Renew.Offer.Type Sales.Channel Total.Claim.Amount Vehicle.Class Vehicle.Size
0 5917 7824.372789 Basic Bachelor Unemployed F 0 17.7,77.7 Urban Married ... 33 NaN 2.0 Personal Auto Personal L2 Offer2 Branch 267.214383 Four-Door Car 2.0
1 2057 8005.964669 Basic College Employed M 63357 28.8,76.6 Suburban Married ... 42 0.0 5.0 Personal Auto Personal L2 Offer2 Agent 565.508572 SUV 2.0
2 4119 8646.504109 Basic High School or Below Employed F 64125 21.6,88.4 Urban Married ... 44 0.0 3.0 Personal Auto Personal L1 Offer2 Branch 369.818708 SUV 1.0
3 1801 9294.088719 Basic College Employed M 67544 19,72.5 Suburban Married ... 15 NaN 3.0 Corporate Auto Corporate L3 Offer1 Branch 556.800000 SUV 3.0
4 9618 5595.971365 Basic Bachelor Retired F 19651 19.1,74.7 Suburban Married ... 68 0.0 5.0 Personal Auto Personal L1 Offer2 Web 345.600000 Two-Door Car 3.0

5 rows × 22 columns

In [84]:
# understanding the shape of the data
train_data.shape
Out[84]:
(9806, 22)
In [85]:
## lets us understand the diffrent levels of train data
def understanding_data(data):
    return pd.DataFrame({"Data Type":data.dtypes,"No of Levels":data.apply(lambda x: x.nunique(),axis=0), "Levels":data.apply(lambda x: str(x.unique()),axis=0)})
understanding_data(train_data)
Out[85]:
Data Type No of Levels Levels
CustomerID int64 9806 [5917 2057 4119 ... 8767 9900 11323]
Customer.Lifetime.Value float64 6477 [7824.372789 8005.964669 8646.504109 ... 20496...
Coverage object 3 ['Basic' 'Extended' nan 'Premium']
Education object 5 ['Bachelor' 'College' 'High School or Below' '...
EmploymentStatus object 5 ['Unemployed' 'Employed' 'Retired' 'Medical Le...
Gender object 2 ['F' 'M' nan]
Income object 4622 ['0' '63357' '64125' ... '26173' '74665' '18017']
Location.Geo object 2840 ['17.7,77.7' '28.8,76.6' '21.6,88.4' ... '22.3...
Location.Code object 3 ['Urban' 'Suburban' 'Rural' nan]
Marital.Status object 3 ['Married' 'Divorced' 'Single' nan]
Monthly.Premium.Auto float64 191 [67.0 101.0 108.0 116.0 72.0 211.0 90.0 93.0 1...
Months.Since.Last.Claim int64 36 [2 26 3 30 14 10 6 17 13 4 8 16 29 24 28 27 1 ...
Months.Since.Policy.Inception int64 100 [33 42 44 15 68 13 76 50 62 41 21 51 61 90 35 ...
Number.of.Open.Complaints float64 6 [nan 0.0 1.0 2.0 3.0 5.0 4.0]
Number.of.Policies float64 9 [2.0 5.0 3.0 6.0 1.0 9.0 8.0 nan 7.0 4.0]
Policy.Type object 3 ['Personal Auto' 'Corporate Auto' 'Special Aut...
Policy object 9 ['Personal L2' 'Personal L1' 'Corporate L3' 'S...
Renew.Offer.Type object 4 ['Offer2' 'Offer1' 'Offer3' nan 'Offer4']
Sales.Channel object 4 ['Branch' 'Agent' 'Web' 'Call Center' nan]
Total.Claim.Amount float64 4125 [267.214383 565.508572 369.818708 ... 852.4603...
Vehicle.Class object 6 ['Four-Door Car' 'SUV' 'Two-Door Car' 'Luxury ...
Vehicle.Size float64 3 [2.0 1.0 3.0 nan]
In [86]:
#summary statistics of train data
train_data.describe()
Out[86]:
CustomerID Customer.Lifetime.Value Monthly.Premium.Auto Months.Since.Last.Claim Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies Total.Claim.Amount Vehicle.Size
count 9806.000000 9806.000000 9012.000000 9806.000000 9806.000000 8988.000000 9685.000000 9806.000000 9680.000000
mean 5778.381807 7998.047015 93.340657 15.143993 48.165001 0.379172 2.960351 438.266734 2.089773
std 3343.286093 6848.055899 34.417763 10.004327 27.963630 0.896427 2.389801 293.502301 0.538524
min 1.000000 1898.007675 61.000000 0.000000 0.000000 0.000000 1.000000 0.099007 1.000000
25% 2879.250000 4013.949039 68.750000 6.000000 24.000000 0.000000 1.000000 280.352767 2.000000
50% 5783.000000 5780.182197 83.000000 14.000000 48.000000 0.000000 2.000000 384.007015 2.000000
75% 8678.750000 8960.280213 109.000000 23.000000 71.750000 0.000000 4.000000 553.540973 2.000000
max 11573.000000 83325.381190 297.000000 35.000000 99.000000 5.000000 9.000000 2893.239678 3.000000
In [87]:
pp.ProfileReport(train_data)
Out[87]:

In [88]:
temp=train_data.corr()
fig = plt.figure(figsize=(10,8))
sns.heatmap(temp,annot=True)
Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e279dac2c8>
In [89]:
# frequency distribution of the data
train_data.hist(bins=50, figsize=(20,20))
plt.show()
In [90]:
# checking for missing data
train_data.isnull().sum()
Out[90]:
CustomerID                         0
Customer.Lifetime.Value            0
Coverage                         925
Education                        129
EmploymentStatus                 118
Gender                           129
Income                             0
Location.Geo                       0
Location.Code                    119
Marital.Status                   129
Monthly.Premium.Auto             794
Months.Since.Last.Claim            0
Months.Since.Policy.Inception      0
Number.of.Open.Complaints        818
Number.of.Policies               121
Policy.Type                      891
Policy                           121
Renew.Offer.Type                 128
Sales.Channel                    128
Total.Claim.Amount                 0
Vehicle.Class                    126
Vehicle.Size                     126
dtype: int64
In [91]:
#plot for missing data in matrix form
msno.matrix(train_data)
Out[91]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e26eb7df88>
In [92]:
train_data.dtypes
Out[92]:
CustomerID                         int64
Customer.Lifetime.Value          float64
Coverage                          object
Education                         object
EmploymentStatus                  object
Gender                            object
Income                            object
Location.Geo                      object
Location.Code                     object
Marital.Status                    object
Monthly.Premium.Auto             float64
Months.Since.Last.Claim            int64
Months.Since.Policy.Inception      int64
Number.of.Open.Complaints        float64
Number.of.Policies               float64
Policy.Type                       object
Policy                            object
Renew.Offer.Type                  object
Sales.Channel                     object
Total.Claim.Amount               float64
Vehicle.Class                     object
Vehicle.Size                     float64
dtype: object
In [93]:
#data type conversion
num_cols = ['customer.lifetime.value','monthly.premium.auto','number.of.open.complaints','number.of.policies','total.claim.amount','vehicle.size']
cat_cols = train_data.columns.difference(num_cols)
cat_cols
Out[93]:
Index(['Coverage', 'Customer.Lifetime.Value', 'CustomerID', 'Education',
       'EmploymentStatus', 'Gender', 'Income', 'Location.Code', 'Location.Geo',
       'Marital.Status', 'Monthly.Premium.Auto', 'Months.Since.Last.Claim',
       'Months.Since.Policy.Inception', 'Number.of.Open.Complaints',
       'Number.of.Policies', 'Policy', 'Policy.Type', 'Renew.Offer.Type',
       'Sales.Channel', 'Total.Claim.Amount', 'Vehicle.Class', 'Vehicle.Size'],
      dtype='object')
In [94]:
#data type conversion
num_cols = ['Customer.Lifetime.Value','Monthly.Premium.Auto','Number.of.Open.Complaints','Number.of.Policies','Total.Claim.Amount','Vehicle.Size']
cat_cols = train_data.columns.difference(num_cols)
cat_cols
Out[94]:
Index(['Coverage', 'CustomerID', 'Education', 'EmploymentStatus', 'Gender',
       'Income', 'Location.Code', 'Location.Geo', 'Marital.Status',
       'Months.Since.Last.Claim', 'Months.Since.Policy.Inception', 'Policy',
       'Policy.Type', 'Renew.Offer.Type', 'Sales.Channel', 'Vehicle.Class'],
      dtype='object')
In [98]:
cat_cols
Out[98]:
Index(['Coverage', 'CustomerID', 'Education', 'EmploymentStatus', 'Gender',
       'Income', 'Location.Code', 'Location.Geo', 'Marital.Status',
       'Months.Since.Last.Claim', 'Months.Since.Policy.Inception', 'Policy',
       'Policy.Type', 'Renew.Offer.Type', 'Sales.Channel', 'Vehicle.Class'],
      dtype='object')
In [99]:
num_data = train_data.loc[:,num_cols]
cat_data = train_data.loc[:,cat_cols]
In [100]:
train_data[cat_cols] = train_data[cat_cols].apply(lambda x: x.astype('category'))
train_data[num_cols] = train_data[num_cols].apply(lambda x: x.astype('float'))
train_data.dtypes
Out[100]:
CustomerID                       category
Customer.Lifetime.Value           float64
Coverage                         category
Education                        category
EmploymentStatus                 category
Gender                           category
Income                           category
Location.Geo                     category
Location.Code                    category
Marital.Status                   category
Monthly.Premium.Auto              float64
Months.Since.Last.Claim          category
Months.Since.Policy.Inception    category
Number.of.Open.Complaints         float64
Number.of.Policies                float64
Policy.Type                      category
Policy                           category
Renew.Offer.Type                 category
Sales.Channel                    category
Total.Claim.Amount                float64
Vehicle.Class                    category
Vehicle.Size                      float64
dtype: object
In [102]:
# Numeric columns imputation
imp = SimpleImputer(missing_values=np.nan, strategy='median')
num_data = pd.DataFrame(imp.fit_transform(num_data),columns=num_cols)

# Categorical columns imputation
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
cat_data = pd.DataFrame(imp.fit_transform(cat_data),columns=cat_cols)


print(num_data.isnull().sum())
print(cat_data.isnull().sum())
Customer.Lifetime.Value      0
Monthly.Premium.Auto         0
Number.of.Open.Complaints    0
Number.of.Policies           0
Total.Claim.Amount           0
Vehicle.Size                 0
dtype: int64
Coverage                         0
CustomerID                       0
Education                        0
EmploymentStatus                 0
Gender                           0
Income                           0
Location.Code                    0
Location.Geo                     0
Marital.Status                   0
Months.Since.Last.Claim          0
Months.Since.Policy.Inception    0
Policy                           0
Policy.Type                      0
Renew.Offer.Type                 0
Sales.Channel                    0
Vehicle.Class                    0
dtype: int64
In [103]:
train_data.drop(['CustomerID'], axis=1, inplace=True) 
In [104]:
standardizer = StandardScaler()
standardizer.fit(num_data)
num_data = pd.DataFrame(standardizer.transform(num_data),columns=num_cols)

train_data = pd.concat([num_data,cat_data],axis=1)
In [105]:
train_data = pd.get_dummies(train_data,columns=cat_cols,drop_first=True)
In [106]:
train_data.head()
Out[106]:
Customer.Lifetime.Value Monthly.Premium.Auto Number.of.Open.Complaints Number.of.Policies Total.Claim.Amount Vehicle.Size Coverage_Extended Coverage_Premium CustomerID_2 CustomerID_3 ... Renew.Offer.Type_Offer3 Renew.Offer.Type_Offer4 Sales.Channel_Branch Sales.Channel_Call Center Sales.Channel_Web Vehicle.Class_Luxury Car Vehicle.Class_Luxury SUV Vehicle.Class_SUV Vehicle.Class_Sports Car Vehicle.Class_Two-Door Car
0 -0.025362 -0.770181 -0.401989 -0.398990 -0.582827 -0.165606 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
1 0.001156 0.256591 -0.401989 0.862970 0.433551 -0.165606 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 0.094697 0.467986 -0.401989 0.021663 -0.233223 -2.034343 0 0 0 0 ... 0 0 1 0 0 0 0 1 0 0
3 0.189267 0.709579 -0.401989 0.021663 0.403879 1.703131 0 0 0 0 ... 0 0 1 0 0 0 0 1 0 0
4 -0.350785 -0.619185 -0.401989 0.862970 -0.315744 1.703131 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 1

5 rows × 17441 columns

In [107]:
x = train_data.copy().drop("Customer.Lifetime.Value",axis=1)
y = train_data["Customer.Lifetime.Value"]
In [108]:
x_train, x_validation, y_train, y_validation = train_test_split(x, y, test_size=0.30,random_state=1)
In [109]:
print(train_data.shape)
print(x_train.shape)
print(y_train.shape)
(9806, 17441)
(6864, 17440)
(6864,)
In [ ]:
 
In [110]:
### Simple Linear Regression using Statsmodels package
##To get all except the target ant target
x.columns
print(x.columns.values[-1])
print(x.columns.values[:-1])
Vehicle.Class_Two-Door Car
['Monthly.Premium.Auto' 'Number.of.Open.Complaints' 'Number.of.Policies'
 ... 'Vehicle.Class_Luxury SUV' 'Vehicle.Class_SUV'
 'Vehicle.Class_Sports Car']
In [111]:
dtr = DecisionTreeRegressor()
dtr.fit(x_train,y_train)
Out[111]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
In [112]:
## Get the predictions on train and test
pred_train = dtr.predict(x_train)
pred_val = dtr.predict(x_validation)
In [113]:
pred_train
Out[113]:
array([ 0.04162264, -0.28535495, -0.72792364, ..., -0.16582759,
       -0.33328813, -0.3647987 ])
In [114]:
pred_val
Out[114]:
array([-0.04138566,  0.6246319 ,  0.0722469 , ...,  0.38828108,
        0.51636558, -0.31904353])
In [115]:
dtr = DecisionTreeRegressor(max_depth=2)
dtr.fit(x_train,y_train)

pred_train = dtr.predict(x_train)


print("Train Error:", mean_absolute_error(y_train,pred_train))
Train Error: 0.4019101530090069
In [117]:
test_df= pd.read_csv("test-1574429501088.csv")
In [118]:
test_df.head(3)
Out[118]:
CustomerID Coverage Education EmploymentStatus Gender Income Location.Geo Location.Code Marital.Status Monthly.Premium.Auto ... Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies Policy.Type Policy Renew.Offer.Type Sales.Channel Total.Claim.Amount Vehicle.Class Vehicle.Size
0 17 Basic Bachelor Employed M 43836.0 12.6,79.4 Rural Single 73.0 ... 44 0 1 Personal Auto Personal L1 Offer1 Agent 138.130879 Four-Door Car Medsize
1 19 Extended College Employed F 28812.0 17.3,78.4 Urban Married 93.0 ... 7 0 8 Special Auto Special L2 Offer2 Branch 425.527834 Four-Door Car Medsize
2 29 Premium Master Employed M 77026.0 18.4,73.5 Urban Married 110.0 ... 82 2 3 Corporate Auto Corporate L1 Offer2 Agent 472.029737 Four-Door Car Medsize

3 rows × 21 columns

In [124]:
test_df.describe()
Out[124]:
CustomerID Income Monthly.Premium.Auto Months.Since.Last.Claim Months.Since.Policy.Inception Number.of.Open.Complaints Number.of.Policies Total.Claim.Amount
count 1767.000000 1528.000000 1695.000000 1767.000000 1767.000000 1767.000000 1767.000000 1767.000000
mean 5834.826825 44606.390707 93.622419 15.022071 47.486701 0.413696 3.002264 423.389681
std 3328.701974 29046.821652 34.752238 10.202317 27.954860 0.955579 2.388154 289.518186
min 17.000000 0.000000 61.000000 0.000000 0.000000 0.000000 1.000000 1.332349
25% 2977.000000 23491.750000 69.000000 6.000000 24.000000 0.000000 1.000000 238.197494
50% 5813.000000 42821.000000 84.000000 14.000000 47.000000 0.000000 2.000000 381.118731
75% 8702.500000 67968.500000 110.000000 23.000000 71.000000 0.000000 4.000000 542.400000
max 11572.000000 99960.000000 298.000000 35.000000 99.000000 5.000000 9.000000 2759.794354
In [136]:
from sklearn.model_selection import GridSearchCV

def model_building(model, params = None, k = 1) :
    
    if params == None :
        model.fit(X_train, y_train)
        
        # return fitted model & train-test predictions
        return (model, model.predict(X_train), model.predict(X_test))
    
    else :
        model_cv = GridSearchCV(model, param_grid = params, cv = k)
        model_cv.fit(X_train, y_train)
        model = model_cv.best_estimator_
        
        # return and extra object for all cross validation operations
        return (model_cv, model, model.predict(X_train), model.predict(X_test))
    ### Load the required libraries
%matplotlib inline
import inspect
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import SVG
from IPython.display import display
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_imputer = SimpleImputer()
num_scaler = StandardScaler()
import seaborn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, export_graphviz,DecisionTreeRegressor
from sklearn.metrics import accuracy_score, recall_score, precision_score,confusion_matrix,mean_absolute_error,mean_squared_error,classification_report
In [140]:
X_actual_test = test_df
num_imputer.fit(X_actual_test)
num_scaler.fit(X_actual_test)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\Anaconda3\lib\site-packages\sklearn\impute\_base.py in _validate_input(self, X)
    198             X = check_array(X, accept_sparse='csc', dtype=dtype,
--> 199                             force_all_finite=force_all_finite, copy=self.copy)
    200         except ValueError as ve:

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    495                 warnings.simplefilter('error', ComplexWarning)
--> 496                 array = np.asarray(array, dtype=dtype, order=order)
    497             except ComplexWarning:

~\Anaconda3\lib\site-packages\numpy\core\numeric.py in asarray(a, dtype, order)
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 

ValueError: could not convert string to float: 'Basic'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
<ipython-input-140-f00ee1eecb1f> in <module>
      1 X_actual_test = test_df
----> 2 num_imputer.fit(X_actual_test)
      3 num_scaler.fit(X_actual_test)

~\Anaconda3\lib\site-packages\sklearn\impute\_base.py in fit(self, X, y)
    230         self : SimpleImputer
    231         """
--> 232         X = self._validate_input(X)
    233 
    234         # default fill_value is 0 for numerical input and "missing_value"

~\Anaconda3\lib\site-packages\sklearn\impute\_base.py in _validate_input(self, X)
    202                 raise ValueError("Cannot use {0} strategy with non-numeric "
    203                                  "data. Received datatype :{1}."
--> 204                                  "".format(self.strategy, X.dtype.kind))
    205             else:
    206                 raise ve

~\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5178                 return self[name]
-> 5179             return object.__getattribute__(self, name)
   5180 
   5181     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'dtype'
In [141]:
X_actual_test = num_imputer.transform(X_actual_test)
X_actual_test = num_scaler.transform(X_actual_test)
X_actual_test = pd.DataFrame(X_actual_test)
---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-141-129e7571b3dc> in <module>
----> 1 X_actual_test = num_imputer.transform(X_actual_test)
      2 X_actual_test = num_scaler.transform(X_actual_test)
      3 X_actual_test = pd.DataFrame(X_actual_test)

~\Anaconda3\lib\site-packages\sklearn\impute\_base.py in transform(self, X)
    374             The input data to complete.
    375         """
--> 376         check_is_fitted(self, 'statistics_')
    377 
    378         X = self._validate_input(X)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
    912 
    913     if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 914         raise NotFittedError(msg % {'name': type(estimator).__name__})
    915 
    916 

NotFittedError: This SimpleImputer instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
In [142]:
X_actual_test.isna().sum()
Out[142]:
CustomerID                         0
Coverage                           0
Education                          0
EmploymentStatus                   0
Gender                             0
Income                           239
Location.Geo                       0
Location.Code                      0
Marital.Status                     0
Monthly.Premium.Auto              72
Months.Since.Last.Claim            0
Months.Since.Policy.Inception      0
Number.of.Open.Complaints          0
Number.of.Policies                 0
Policy.Type                       42
Policy                             0
Renew.Offer.Type                   0
Sales.Channel                      0
Total.Claim.Amount                 0
Vehicle.Class                      0
Vehicle.Size                       0
dtype: int64
In [143]:
actual_test_pred = clf.predict(X_actual_test)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-143-18a61f81e06f> in <module>
----> 1 actual_test_pred = clf.predict(X_actual_test)

NameError: name 'clf' is not defined
In [144]:
actual_test_pred
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-144-5f2c7f7f5a49> in <module>
----> 1 actual_test_pred

NameError: name 'actual_test_pred' is not defined
In [145]:
new_pred = pd.DataFrame(actual_test_pred)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-145-96a094709055> in <module>
----> 1 new_pred = pd.DataFrame(actual_test_pred)

NameError: name 'actual_test_pred' is not defined
In [147]:
sample= pd.read_csv('sample_submission-1577482703002.csv')
In [151]:
sample.customerLifetime.Value.head()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-151-8175225c2eb3> in <module>
----> 1 sample.customerLifetime.Value.head()

~\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5178                 return self[name]
-> 5179             return object.__getattribute__(self, name)
   5180 
   5181     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'customerLifetime'
In [152]:
sample.target=actual_test_pred
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-152-c67c5abded2e> in <module>
----> 1 sample.target=actual_test_pred

NameError: name 'actual_test_pred' is not defined
In [ ]: